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Abstract. We describe an algorithm for computing Bernoulli numbers. Us- 



record. Our method is to compute B^ modulo p for many small primes p, 

and then reconstruct B^ via the Chinese Remainder Theorem. The asymp- 

»«f-\ ' totic time complexity is 0(k log^^ k), matching that of existing algorithms 

that exploit the relationship between B^ and the Riemann zeta function. Our 
implementation is significantly faster than several existing implementations of 
the zeta-function method. 



1. Introduction 
The Bernoulli numbers So, -Bi, B2, ■ ■ • are rational numbers defined by 



>• ' 

r-»^ ' In this paper we focus on the problem of computing B^, as an exact rational 

^^ , number, for a single large (even) value of k. To date, the most efficient algorithms 

CO ' for computing an isolated Bk have been based on the formula 

!>■ (1) Sfe = (-l)'/'+'^S^, fc > 2 and /fc even, 



(27r)'= 

00 . where ({s) is the Riemann zeta function [IR90[ p. 231]. One first computes a high- 

^7-; ' precision approximation for Bk via ([1]), using the Euler product for ({s). The exact 

denominator of Bk is known by the von Staudt-Clausen theorem. Provided that 
the numerical approximation to Bk has enough correct bits — it turns out that 
O(fclogfc) bits is enough — we may recover the numerator of Bk exactly. We will 
C^ ' refer to this algorithm and its variants as the 'zeta-function algorithm'. 

The zeta-function algorithm has been rediscovered numerous times. The formula 
(P) itself goes back to Euler. Chowla and Hartung |CH72] give an 'exact formula' 
for Bk, the evaluation of which is tantamount to executing the algorithm, but using 
the defining sum for ({k) instead of the Euler product, and using 2(2'^ — 1) in place 
of the exact denominator of Bk ■ Fillebrown |Fil92j mentions [CH72| and points out 
that using the Euler product is asymptotically faster; she gives a detailed time and 
space complexity analysis, and attributes the algorithm to Herbert Wilf (unpub- 
lished). Fee and Plouffe |FP07j state that they discovered and implemented the 
zeta-function algorithm in 1996. Kellner [Kel02j also refers to [CH72] and suggests 
using the Euler product. He mentions some data produced by Fee and Plouffe, 
although not their algorithm. According to Pavlyk [PavOSj . the implementation 



Date: 13th October 2008. 



2 DAVID HARVEY 

in Mathematica 6 [Wol07| derives from Kellner's work. Stein |Ste07|, p. 30] indi- 
cates another independent discovery by Henri Cohen, Karim Belabas, and Bill Daly 
(unpublished) , that led to the implementation in PARI/GP |Par07| . 

In this paper we present a new algorithm for computing Bk that does not de- 
pend on IT]) or any properties of Ci^)- The basic idea is to compute Bk modulo 
p for sufficiently many small primes p, and then reconstruct B^ using the Chinese 
Remainder Theorem. Using a parallel implementation, we have computed Bk for 
k ~ 10^, improving substantially on the previous record of fc = 10^ jPavOSj . 

Let M{n) denote the time (bit operations) required to multiply two n-bit inte- 
gers. We may assume that M{n) — 0(nlog^^^ n), using for example the Schonhage- 
Strassen algorithm [SS71| . According to Fillebrown's analysis |Fil92| p. 441], the 
zeta-function algorithm for computing Bk requires 0{k) multiplications of integers 
with O(fclogfc) bits, so its running time is 0{k^ log ^"^ k). The new algorithm also 
runs in time 0(fc^ log^"*"^ fc). However, we will see that over the feasible range of 
computation the running time behaves more like 0{k^ logfc). 

The new algorithm is easily parallelisable, as the computations for the various 
p are independent. Only the final stages of the modular reconstruction are more 
difficult to execute in parallel, but these account for only 0(fc^+^) of the running 
time. During the modular computations, each thread requires only 0(log k) 
space. The zeta-function algorithm may also be parallelised, for instance by map- 
ping Euler factors to threads, but this requires 0(fc logfc) space per thread. 

2. Computing Bk modulo p 

Throughout this section p> 5 denotes a prime, and fc an even integer satisfying 
2 < k < p—S. Our algorithm for computing Bk modulo p is based on the following 
well-known congruence. Let < c < p, and suppose that c^ 1 (mod p). Then 

k ^^^ 

(2) ij,^^_^^fc-i/^^(^) (modp), 

x — l 

where 

, , (x mod p) — c{x/c mod p) c — 1 
iic[x) :— I - . 

p 2 

Equation ([2]) may be deduced by reading the statement of Theorem 2.3 of [Lan90| 
§2] modulo p (as is done in the proof of Corollary 2 to that theorem). 

Equation ^ may be used to compute Bk modulo p directly, but computing x''~^ 
modulo p for each x is unnecessarily expensive. Instead, we select a generator g of 
the multiplicative group (Z/pZ)^, put c = g, and rewrite the sum as 

h ^~^ 

(3) Bk^^—^Y.9'''^~'^^a^9') (modp). 

Note that g^ ^ 1 (mod p) since p ~ \\k. For a; ^ (mod p) we have hc{—x) = 
— hc{x), and g^P^^^^ = —1 (mod p), so we obtain the half-length sum 

2fc ^-^'/^ 



(4) ^"^r^^ E a'^'^'^^Agl (modp), 

9 1=1 
leading to the following algorithm. 
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Algorithm 1: Compute Bk modulo p 



5 ^- a generator of (Z/pZ)' 
r <— g'^^^ (mod p) 
u^(g- l)/2 (mod p) 



s 


^0, 


X ^l. Y ^r 


f( 

e 


ar i ^ 

X <- 
nd 


1 to (p-l)/2 do 
[9X/p\ 

- {S+ {u-q)Y) modp 

- gX mod p, Y ^- rY mod p 


r 


eturn 


2kS/{l-g'')modp 



Proposition 1. Let p > 5 be a prime, and let k be an even integer in the interval 
2 < k < p ~ 3. Algorithm\T\ computes Bk modulo p in time 0{plog p). 

Proof. At the top of the loop we have X = g*^^ (mod p) and Y = g'C^^^) (mod p). 
The value assigned to q is {g{g^~^ mod p) — {g^ mod p))/p — u— hg{g^). Therefore 
the value added to S is g^'^''~^^hg{g^). By Q the return value is Bk modulo p. 

We now analyse the complexity. A generator of (Z/pZ) ^ may be found determin- 
istically in time 0{p^''^~^^) |Shp96| . In the main loop, the algorithm performs 0{p) 
additions, multiplications and divisions (with remainder) of integers with O(logp) 
bits. The computation of r requires another O(logfc) = O(logp) such operations. 
The division by 1 — g^ requires an extended GCD of integers with O(logp) bits. 
The total cost is O {p log^+^ p) . D 

Remark. To compute Biqs (see SJ5]), the largest prime needed is 1558322063. Even 
on a 32-bit machine, this fits into a single machine word, and the cost of arithmetic 
modulo p does not depend on p. Therefore, over the feasible range of computation, 
the complexity of Algorithm [1] may be regarded as only 0{p). 

3. Computing Bk as a rational number 

In this section we present a multimodular algorithm for computing Bk — Nk/Dk 
as an exact rational number. Before getting to the algorithm proper, we make some 
preliminary calculations. In what follows, we assume that fc > 4 and that k is even. 

The denominator Dk is given by the von Staudt-Clausen theorem [IR90[ p. 233]: 



Dk^ n 



p pnnic 

p-llfc 



Note that Dk < 2^+\ since Dk divides 2(2*= - 1) |(:;H721 p. 114]. 

The size of the numerator Nk may be bounded quite tightly as follows. From 
^ and Stirling's approximation [Lan97| p. 120] wc have 

log \Bk\= log2 + logC(fc) + logfc! - fclog27r 

< fk + i j log fc - (1 + log 27r)fc + (log 2 + log C(fc) + log ^2^) + ^. 

Since ({k) < ({A) and 12fc > 48, a short calculation shows that \Nk\ < 2^ where 
(3 := \{k + 0.5) log2 k - 4.094A: + 2.470 + loga Dk] . 
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For any X > 0, let AIx ■— Ylp<xp-UkP- ^^^ strategy will be to select X so 
that Mx > 2^, and then compute Nk modulo Mx- This ensures that we have 
sufficient information to reconstruct Nk- In the algorithm itself our choice of X 
will be quite sharp, but we also need a rough a priori estimate. We will take 
Y := max(37, [(A; + 0.5) logj kj). To check that this works, we will use the estimate 
Sn<a; logP ^ O.TSx, 3, wcak form of the prime number theorem, valid for x > 37 
[NarOOt p. 111]. Then we have 

log2 My ^ ~ log2 £>*: + X! 1°S2 V 

p<Y 

., ,x 0.73,, 

^-(^ + 1) + !^^ 

>Y -k-l 

>{k + 0.5) log2 k-k-1 



>{k + 0.5) log2 k - 


- 3.094A: + 6.470 (since 


i k>4:) 


>{k + 0.5) log2 k - 


- 4.094A: + 3.470 + log2 . 


Dk + 2 


> r(fc+o.5)iog2fc 


- 4.094fc + 2.470 + log2 


Dk\+2 



= /3 + 2, 

so that My > 2^+^. Algorithm [21 presents the resulting algorithm; several impor- 
tant details are given in the proof of the following theorem. 

Algorithm 2: Compute Bk as a rational number {k > A, k even) 

// compute Dk and preliminary bounds 
Y ^ max(37, \{k + 0.5) log2 k'] ) 
Compute list of all primes p <Y 

Dk^l\p-i\kP 

f3^\{k + 0.5) log2 k ~ 4.094fc + 2.470 + log2 Dk\ 

II compute tight bound X 
p^3. A/' ^1.0 
while M' < 2'3+i do 

p <— NextPrime(p) 

if p-l\k then M' ^ pM' 
end 
X^p 

II collect modular information 

for p e {2,3,5, .. . ,X},p— 1 ] k do rp ^- Bk (mod p) 

II reconstruction 

M^Mx^Up<x,p-MkP 

R ^- unique integer in [0, M) congruent to r^ modulo p for all primes p \ M 

N' ^ DkR (mod M) 

if fc = 2 (mod 4) then Nk ^ N' else Nk ^ N' - M 

return Nk/Dk 

Theorem 2. Algorithmic computes Bk in time 0{k^ log ^^ k). 
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Proof. Computing all primes less than Y requires time 0(fc^+^), via an elemen- 
tary sieving algorithm, since Y = O(fclogfc). In what follows we assume that all 
primality tests and enumeration of primes are performed with the aid of this list. 

To compute Dk , make a list of the factors of k (for example, by testing divisibility 
of k by each integer in [2, Vk]), and for each factor d\k check whether d+ 1 is prime. 
Multiply together those that are prime using a product tree [vzGGOSl Algorithm 
10.3, p. 293]. Since logDfc = 0(fc), this takes time 0{k^+^). 

The goal of the while- loop is to quickly find a bound X, much tighter than Y, 
such that Mx > 2^^ . The variable M' holds an approximation to the product of the 
primes encountered so far. It should be represented in floating-point, with at least 
n — [log2 y] + 1 bits in the mantissa, so that each multiplication by p magnifies 
the relative error by at most 1 -I- 2~". We claim that the loop terminates before 
exhausting the precomputed list of primes. Indeed, let s be the number of p such 
that p <Y and p— l\k. After s iterations, M' approximates My with relative error 
at most (1 + 2-")" < (1-^(2^)-!)'^ < e^/^; that is, M' > e'^^^My > e-^^^2^+^ > 
2/3-1-1 _ xhis is impossible since the loop termination condition would already have 
been met earlier. This argument also shows that the selected bound X satisfies 
Mx > e-^/22''+i > 2f^. Since Y = O(fclogfc), computing X takes time 0(fci+^). 

In the for-loop, for each p we use Algorithm [1] to compute Vp := Bk (mod p). 
Note that Algorithm [1] requires that 2 < fc < p — 3. To satisfy this requirement, 
we put m = k mod {p — 1), compute Bm (mod p) using Algorithm [1] and then 
recover Bk (mod p) via Rummer's congruence B^/k = Bm/ra (mod p) jLan90( 
p. 41]. The total number of primes processed is no more than 7r(y) = 0{Y/\ogY) = 
0{k log k/ log(fc log k)) = 0{k). According to Proposition[Tl the cost for each prime 
is 0{p\og^'^'^ p) — 0{k\og^^^ k), since p <Y — O(fclogfc). Therefore the total cost 
of the while-loop is 0{k^ log^^^ k). 

We compute M and R simultaneously using fast Chinese Remaindering [vzGG03l 
Theorem 10.25, p. 302]; the cost is 0((logMx)^+^) = 0((fclogfc)i+^) = 0{k^+^). 

In the last few lines we recover Nk- By construction we have R = Bk (mod M), 
so TV' = Nk (mod M). Note that \Nk\ < 2^^ < M. If fc = 2 (mod 4) then Nk is 
positive so we simply have Nk = N'] otherwise Nk is negative, and we must correct 
N' by subtracting M. D 

Remark. During the modular collection phase, we skipped those p for which p— 1 \ k. 
An alternative would be to use the fact that pBk = — 1 (mod p) for these p jIR90[ 
p. 233], and to apply the multimodular algorithm directly to Nk rather than to Bk. 
This would require the additional minor overhead of computing Dk (mod p) for all 
of the other primes. 

Remark. As mentioned below the proof of Proposition [l] the running time of Al- 
gorithm [1] behaves like 0{p) in practice. The corresponding bound for the running 
time of Algorithm [2] is 0{k^ logfc). 

4. Implementation techniques for computing Bk modulo p 

In this section we sketch several techniques that yield a large constant factor 
improvement in the speed of computing Bk modulo p, for almost all k and p. In 
experiments we have found that these techniques yield an increase in speed by a 
factor in excess of 150 compared to an efficient implementation of Algorithm [T] 
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We first perform some preparatory algebra. Consider again the congruence ([2|). 
Let n be the multiplicative order of c in {'L/p'L)^ ^ and put m = {p — !)/«. Let 
5 be a generator of (Z/pZ)^. Then {g*c~-'}o<i<m.o<j<n forms a complete set of 
representatives for {Z/pZ)^ . Putting x — g^c^^ , we obtain 

(5) Bk^^^ Y. (a'-'Yic'-T'hAg'c-n (modp). 

< i < m 
0<j<n 

Considerations similar to those of ^ allow us to halve the number of terms, as 
follows. First suppose that n is even. Since hc{—x) = —hc{x) and c"/^ = —1 
(mod p), the double sum in ([5]) becomes 

{)<~i<m 0<.i<m 

0<j<n/2 0<j<n/2 

^2 Y {g'-'nc'^-'r'hcig'c-^) (niodp). 

0<j<n/2 

Now suppose that n is odd. Then m is even, and we have (— (;™/2)" = (^—ly^gip-'^)/^ = 
(— 1)(— 1) = 1 (modp), so that —g"^/^ = c^ (mod p) for some < A < n. The 
double sum in ([5|) becomes 

0<i<m/2 0<i<m/2 

0<j<n 0<j<n 

^ E (ff''')*(c'-')-^>ic(ffV-^') - E (/"')Xc'-^)--''+'/ic(-ffV-^"+^) 

0<i<m/2 0<i<m/2 

0<j<n 0<j<n 

^2 E (/"'r(c'"')"^/ic(5'c~^) (modp). 

0<i<m/2 
0<j<n 

We combine both cases into a single statement by writing 

9k 

(6) B^^Y^^ E (5'"')' E ic'-')-'h.{9'c-n (modp), 

0<i<?n' 0<j<n' 



where 



, I n/2 n even, , (p — l)/2 

n = < m = ; . 

n n odd, n' 



The sum in ([6]) may be evaluated by an algorithm similar to Algorithm [1] with 
an outer loop over i and an inner loop over j; we omit the details. The inner loop 
is executed n'm' = {p — l)/2 times, the same number of iterations as in the main 
loop of Algorithm [TJ Note that if c is chosen 'randomly', then we expect n' to be 
quite large, so it is imperative to make the inner loop as efficient as possible. 

To this end, we specialise to the case c = 1/2 (mod p); this will allow us to 
exploit the fact that multiplication by 2 modulo p is very cheap. Of course, this 
specialisation does not work if 2*^ = 1 (mod p). In practice, such primes tend to be 
rare for any given k, and we may fall back on Algorithm [1] to handle them. 
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Directly from the definition of hc{x), we fiave hi/2{x) = — /(x)/4 where 

,, . fl < {xmodp) <p/2, 

j[x) :— < 

1—1 p/2 < {x mod p) < p. 

From ^ we obtain 

(7) ^'^^ 2(2-''^-l) ^ (^'"')' ^ (2'=-i)V(5''2^) (modp), 

where n is the order of 2 in (Z/pZ) ^ , and n' and to' are defined as before. 

In the algorithm that evaluates ([7]), the inner loop is considerably simpler than 
the main loop of Algorithm[T] Given g^2^ at the beginning of the loop, we compute 
the next term ^'2^+^ by doubling it and subtracting p if necessary, and f{g^2^) is 
— 1 or +1 according to whether the subtraction occurred. Therefore only a single 
modular multiplication is required on each iteration, to keep track of (2''^^y . 

Next we describe further optimisations for computing a sum of the form 

(8) Y^ rV(2^s) (modp). 

0<j<N 

(For evaluating ([7]), we take N — n' , r — 2*^^^, s — cf .) The following observation 
is crucial: if < s < p, and if the binary expansion of s/p is 0. 60^1^2 • ■ • (where 
each hj is or 1), then f{2^s) = (— 1)''^. Indeed, the binary expansion of 2^ s/p is 
obtained by shifting that of s/p to the left by j digits, so 

(2-'s mod p)/p — O.bjbj+i . . . , 

and then 

f{2^s) = +1 ^F=^ < (2^s modp) < p/2 ^=^ bj = 0. 
Assume that we have available a routine for computing the binary expansion of s/p, 
whose output is a sequence of base-2™ digits. Strip off the last N mod m terms of 
(ISl) and compute them separately; we may then assume that N is divisible by to,. 
Put j = mt + u, so that ^ becomes 

(9) J2 (^''")* Yl '■"/(2'"*+"s)- 

0<t<A'/m 0<u<m 

We may now use a caching strategy as follows. Maintain a table {Ta} of length 2™, 
where a G {+1,-1}™. Each To- is a residue modulo p, and is initialised to zero. 
For each < < < N/m, add (r™)* to the table element T^-, where a is determined 
from the i-th base-2'" digit of s/p by 

a - (/(2™*s), /(2"*+is), . . . , /(2"*+™-is)). 

(In practice, a is represented by the sequence of bits themselves, and is obtained 
by a simple bit-mask.) After finishing the loop over t, compute the 2™ values 
Va = X]o<u<m ^"'''"' where a = {ao, . . . ,am-i)- Then the desired sum is given 
by ^^ VfjT„. The main benefit of this approach is that in the inner loop we only 
perform N/m modular multiplications, instead of N . The tradeoff is that we must 
maintain the table {T„}, and some extra work is required at the end to assemble the 
information from the table. For this to be worthwhile, we should have iV ^ 2™. We 
should also choose to so that the base-2'" digits of s/p can be extracted efficiently. 
Moreover, memory locality problems can offset the savings in arithmetic if m is too 
large. We have found in practice that m — 8 works quite well. 
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In the implementation discussed in ^ we made several further optimisations, 
which we describe only briefly. 

For better memory locality — and indeed less total memory consumption — we 
split ([9]) into blocks and process each block separately. This avoids needing to store 
and then later retrieve too many bits of the expansion of s/p. Instead of comput- 
ing s/p for each s separately, we precompute an approximation to 1/p, and then 
multiply it by each s that occurs, taking advantage of the speed of multiplication 
by a single-word integer compared to the corresponding division. 

We use several tables instead of just one. For example, on a 32-bit machine, we 
use four tables: the highest 8 bits of each word of s/p contribute to the first table, 
the next 8 bits to the second table, and so on. This reduces further the number of 
modular multiplications. When adding values into a table, we skip the reduction 
modulo p, delaying this until the end of the loop. Of course, this is only possible 
when the word size is large enough compared to p, otherwise overflow may occur. 
Finally, instead of re-initialising the table on each iteration of the outer loop, we 
simply continue to accumulate data into the same table. This partly mitigates 
against the overhead incurred when m' is unusually large compared to n' . 

Most of the modular arithmetic uses Montgomery's reduction algorithm [Mon85j . 

5. Performance data 

The author implemented the above algorithms in a CH — h package named bernmm 
(Bernoulli multi- modular). The source code is released under the GNU General 
Public License (GPL), and is freely available from the author's web page. It depends 
on the GMP library |Gra08j for arbitrary precision arithmetic, and on NTL |Sho07| 
for some of the single-precision modular arithmetic. It is included as a standard 
component of the Sage computer algebra system (version 3.1.2) [SJ05| . accessible 
via the bernoulli function. 

The parallel version distributes the modular computations among the requested 
number of threads, and depends on the standard pthreads API. For the reconstruc- 
tion phase, the lowest layers of the subproduct tree are performed in parallel, but 
the number of threads decreases towards the top of the tree, since the extended 
GCD routine is not threaded. 

Table [T] shows timing data for 10'* < fc < 10®, spaced by intervals of \/IO, for 
bernmm and several existing implementations of the zeta-function algorithm. The 
timings were obtained on a 16-core 2.6GHz AMD Opteron (64-bit) machine with 
96 GB RAM, running Ubuntu Linux. The last two rows of the table (fc = [10^-^J 
and k = 10^) improve on the prior record fc = lO'' set by Pavlyk |Pav08| . The 
values -B31622776 and Biqs. may be downloaded from the author's web page. 

The compiler used was gcc 4.1.3. We used the version of GMP shipped with Sage 
3.1.2, which is the same as GMP 4.2.2, but also includes assembly code written by 
Pierrick Gaudry that improves performance on the Opteron, and code by Niels 
MoUer that implements a quasi-linear time extended GCD algorithm |Mol08| . We 
used NTL version 5.4.1, also included in Sage. 

The timings for PARI/GP were obtained for version 2.3.3, using the same patched 
version of GMP. We used the bernf rac function from the GP interpreter. The 
timings for calcbn, a specialised program for computing Bernoulli numbers, were 
obtained from calcbn 2.0 [Kel08| . again using the same patched GMP. The timings 
for Mathematica were obtained from the Linux x86 (64-bit) version of Mathematica 
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PARI 


MMA 


calcbn 


bernmm 






CPU 


CPU 


CPU 


wall 


CPU 


wall 


k 


threads 


time 


time 


time 


time 


tim.e 


time 


10^ 


1 


0.07 


0.18 


0.04 




0.25 




[lO^-Sj 


1 


0.68 


1.80 


0.49 




1.02 




10^ 


1 


7.45 


21.7 


5.72 




5.46 




[10^-5] 


1 


88.3 


285 


82.2 




38.6 






2 






82.3 


41.6 


38.6 


20.4 


10<^ 


1 


1058 


3434 


935 




340 






4 






967 


246 


341 


96.1 


[106-5] 


1 


1.24-104 


4.25- 10^ 






3397 






4 






1.33-10'' 


3385 


3406 


909 


10^ 


1 


1.42-105 


5.10-10^ 












10 






1.60- 10^ 


1.61 -10* 


3.59-10'* 


3938 


[lO^-SJ 


10 










3.86-105 


4.03-104 


10* 


10 










4.19-106 


4.27-105 



Table 1. Time (in seconds) for computing Bk for PARI/GP, 
Mathematica, calcbn and bernmm. 



6.0, using the BernoulliB function. Neither PARI/GP nor Mathematica supply a 
parallel implementation as far as we are aware. 

The peak memory usage for computing Biqs was approximately 5 GB; this oc- 
curred during the final extended GCD operation. By comparison, PARI/GP used 
4.3 GB for k = 10^, and calcbn used 1.7 GB for k = lO"^ (with ten threads). 

For each implementation, we observe that the ratio of the times in adjacent rows 
of the table is approximately 10, reflecting the predicted quasi-quadratic asymptotic 
running time of both the zeta-function algorithm and the multimodular algorithm. 

We checked our results by two methods. First, we checked that as a real number, 
the proposed value for Bf^ agrees with the estimate |i?j.| w 2(27r)~'^fc!. Since we 
computed Bk by modular information alone, this is a very strong consistency check. 
Second, we reduced modulo p the proposed value for B^, and compared this against 
the output of Algorithm [1] for a few primes distinct from those primes that we used 
to compute Bk- Again, this is a very strong consistency check — stronger in fact, 
since it involves all of the bits of the proposed Bk- 

Mathematica, PARI/GP and calcbn use GMP internally, so naturally any im- 
provement in GMP's large integer arithmetic will improve the timings reported in 
the table. In this connection we mention the improvements in multiplication speed 
reported by |GKZ07J : their paper also gives comparisons with other implementa- 
tions of large-integer arithmetic. 
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